Skip to content

Feature/unswizzle#2732

Open
int-smart wants to merge 6 commits intoNVIDIA:mainfrom
int-smart:feature/unswizzle
Open

Feature/unswizzle#2732
int-smart wants to merge 6 commits intoNVIDIA:mainfrom
int-smart:feature/unswizzle

Conversation

@int-smart
Copy link

Description

This PR adds unswizzle support for scaling factors and extends the swizzle module so scaling tensors can be converted from GEMM-swizzled layout back to compact layout, including multi-tensor paths. It also adds round-trip and standalone tests to validate unswizzle correctness.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Added unswizzle APIs and implementation in transformer_engine/common/swizzle/swizzle.cu and declarations in transformer_engine/common/include/transformer_engine/swizzle.h
  • Added multi-tensor unswizzle support with swizzle-like validation assumptions (homogeneous scaling mode/layout, swizzled input and compact output expectations)
  • Refactored multi-tensor unswizzle launch/kernels to mirror swizzle structure (split row-wise and column-wise kernels) for easier readability
  • Added/extended tests in tests/cpp/operator/test_swizzle.cu, including standalone unswizzle and swizzle→unswizzle round-trip coverage

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

int-smart and others added 6 commits March 3, 2026 20:40
- Introduced `nvte_unswizzle_scaling_factors` to convert swizzled scaling factors back to row-major format.
- Implemented `regs_unshuffle_with_bit_shifts` and `regs_unshuffle` for unshuffling operations in CUDA kernels.
- Added `unswizzle_row_scaling_kernel_impl` and `unswizzle_col_scaling_kernel_impl` for handling unswizzling in row and column scaling respectively.

These changes enhance the functionality of the swizzle module, enabling better handling of scaling factors in tensor operations.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
These enhancements tests the changes introduced for unswizzling

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
- Introduced `compute_ref_unswizzle` to handle the conversion of swizzled scaling factors back to their original format.
- Added `performTestUnswizzle1D` to validate the unswizzling process with various scaling modes.
- Created `UnswizzleTestSuite` for comprehensive testing of unswizzling operations.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
- Moved the definition of `swizzle_row_scaling_kernel` to a new location for better organization.
- Ensured the kernel implementation is now properly defined and accessible for scaling operations in the swizzle module.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
- Introduced `multi_tensor_unswizzle_scaling_factors` to convert swizzled scaling factors back to their original row-major format.
- Implemented CUDA kernels for unswizzling in both row and column scaling, enhancing the swizzle module's functionality.
- Updated the launch function to handle multiple tensor unswizzling operations efficiently.

These changes improve the handling of scaling factors in tensor operations, ensuring better performance and organization within the swizzle module.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR adds unswizzle support for MXFP8 and NVFP4 scaling factors, providing the inverse operation to the existing nvte_swizzle_scaling_factors API. It introduces nvte_unswizzle_scaling_factors and nvte_multi_tensor_unswizzle_scaling_factors, along with the corresponding GPU kernels (unswizzle_row_scaling_kernel_impl, unswizzle_col_scaling_kernel_impl) and byte-level inverse shuffle helpers (regs_unshuffle, regs_unshuffle_with_bit_shifts). Tests covering standalone unswizzle and swizzle→unswizzle round-trips are also added.

Key observations:

  • The two new inverse shuffle functions (regs_unshuffle and regs_unshuffle_with_bit_shifts) are mathematically correct inverses of their forward counterparts for all supported LType variants (int, int2, int4).
  • Bug in test helpers: In both performTestUnswizzle1D and performTestSwizzleUnswizzleRoundtrip, the variables SF_MODE_X and SF_MODE_Y are uninitialized when the !(rowwise || columnwise) branch of the skip guard is taken, causing undefined behaviour in the GTEST_SKIP message.
  • The rowwise_swizzle/columnwise_swizzle variable names in multi_tensor_unswizzle_scaling_factors should be named rowwise_unswizzle/columnwise_unswizzle to avoid confusion about data-flow direction.
  • Minor: both skip messages are missing a space before "is not implemented.", and the regs_unshuffle_with_bit_shifts function body is not followed by a blank line before the next template.

Confidence Score: 4/5

  • Production kernel logic is sound; two UB issues in test helpers should be fixed before merging.
  • The core GPU kernels and inverse shuffle helpers are mathematically correct and mirror the existing swizzle structure. The only defects found are confined to test code: two instances of reading uninitialized variables in GTEST_SKIP messages (undefined behaviour, though unlikely to cause silent data corruption in production). The remaining issues are cosmetic naming/style concerns. No functional regression risks exist in the shipped library code.
  • tests/cpp/operator/test_swizzle.cu — fix uninitialized SF_MODE_X/SF_MODE_Y before merging.

Important Files Changed

Filename Overview
transformer_engine/common/swizzle/swizzle.cu Adds regs_unshuffle_with_bit_shifts and regs_unshuffle (verified correct inverses of their forward counterparts), two new GPU kernel impl functions for row- and column-wise unswizzle, a unified dispatch kernel, multi-tensor row/col unswizzle kernels, and the full unswizzle_scaling_factors / multi_tensor_unswizzle_scaling_factors host-side logic. Variable names rowwise_swizzle/columnwise_swizzle in the multi-tensor unswizzle path are misleadingly named; a missing blank line exists after regs_unshuffle_with_bit_shifts.
transformer_engine/common/include/transformer_engine/swizzle.h Adds declarations for nvte_unswizzle_scaling_factors and nvte_multi_tensor_unswizzle_scaling_factors with complete Doxygen comments that accurately describe inputs, outputs, and requirements. No issues found.
tests/cpp/operator/test_swizzle.cu Adds compute_ref_unswizzle, performTestUnswizzle1D, performTestSwizzleUnswizzleRoundtrip, and corresponding GTest suites/instantiations. Two separate instances of undefined behaviour: SF_MODE_X/SF_MODE_Y are read uninitialized in the GTEST_SKIP message when neither rowwise nor columnwise is set (lines 155-157 and 296-298). Also has missing spaces in both skip messages.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant nvte_unswizzle_scaling_factors
    participant unswizzle_scaling_factors
    participant unswizzle_scaling_kernel
    participant unswizzle_row_impl as unswizzle_row_scaling_kernel_impl
    participant unswizzle_col_impl as unswizzle_col_scaling_kernel_impl

    Caller->>nvte_unswizzle_scaling_factors: input (swizzled), output (compact), stream
    nvte_unswizzle_scaling_factors->>unswizzle_scaling_factors: convertNVTETensorCheck()
    unswizzle_scaling_factors->>unswizzle_scaling_factors: validate scaling_mode, dtype, shapes
    alt rowwise_unswizzle
        unswizzle_scaling_factors->>unswizzle_scaling_kernel: launch<<<grid,block,slm,stream>>>
        unswizzle_scaling_kernel->>unswizzle_row_impl: row_scaling=true
        unswizzle_row_impl->>unswizzle_row_impl: load tiles to SLM
        unswizzle_row_impl->>unswizzle_row_impl: regs_unshuffle()
        unswizzle_row_impl->>unswizzle_row_impl: write compact output
    else columnwise_unswizzle
        unswizzle_scaling_factors->>unswizzle_scaling_kernel: launch<<<grid,block,slm,stream>>>
        unswizzle_scaling_kernel->>unswizzle_col_impl: row_scaling=false
        unswizzle_col_impl->>unswizzle_col_impl: load tiles to SLM
        unswizzle_col_impl->>unswizzle_col_impl: regs_unshuffle_with_bit_shifts()
        unswizzle_col_impl->>unswizzle_col_impl: write compact output
    end
    unswizzle_scaling_factors-->>Caller: compact scale_inv in output
Loading

Last reviewed commit: 621bc16

Comment on lines +155 to +158
if ((rowwise && columnwise) || !(rowwise || columnwise)){
GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
std::to_string(SF_MODE_Y) + "is not implemented.";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uninitialized variables used in skip message

When !(rowwise || columnwise) is true (neither flag is set), neither if (rowwise) nor if (columnwise) branch executes, leaving SF_MODE_X and SF_MODE_Y uninitialized. Passing them to std::to_string() is undefined behaviour.

The same issue exists in performTestSwizzleUnswizzleRoundtrip at line 297.

Suggested change
if ((rowwise && columnwise) || !(rowwise || columnwise)){
GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
std::to_string(SF_MODE_Y) + "is not implemented.";
}
if ((rowwise && columnwise) || !(rowwise || columnwise)){
GTEST_SKIP() << "TEST SKIPPED, The scaling mode is not implemented.";
}

Comment on lines +296 to +299
if ((rowwise && columnwise) || !(rowwise || columnwise)){
GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
std::to_string(SF_MODE_Y) + "is not implemented.";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uninitialized variables used in skip message (roundtrip test)

Same undefined-behaviour issue as in performTestUnswizzle1DSF_MODE_X and SF_MODE_Y are uninitialized when !(rowwise || columnwise).

Suggested change
if ((rowwise && columnwise) || !(rowwise || columnwise)){
GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
std::to_string(SF_MODE_Y) + "is not implemented.";
}
if ((rowwise && columnwise) || !(rowwise || columnwise)){
GTEST_SKIP() << "TEST SKIPPED, The scaling mode is not implemented.";
}


if ((rowwise && columnwise) || !(rowwise || columnwise)){
GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
std::to_string(SF_MODE_Y) + "is not implemented.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in skip message

The concatenated string produces "...32is not implemented." (no space before "is"). Add a leading space.

Suggested change
std::to_string(SF_MODE_Y) + "is not implemented.";
std::to_string(SF_MODE_Y) + " is not implemented.";


if ((rowwise && columnwise) || !(rowwise || columnwise)){
GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
std::to_string(SF_MODE_Y) + "is not implemented.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in skip message (roundtrip test)

Same missing space issue — produces "...32is not implemented." without a space.

Suggested change
std::to_string(SF_MODE_Y) + "is not implemented.";
std::to_string(SF_MODE_Y) + " is not implemented.";

Comment on lines +80 to 82
for (int i = 0; i < kVectorSize; i++) regs[i] = new_regs[i];
}
template <typename LType, int SF_TILE_DIM_M, int SF_TILE_DIM_K>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing blank line after function definition

regs_unshuffle_with_bit_shifts ends and the next template declaration begins immediately (no blank line). Every other function pair in this file is separated by a blank line. Add one for consistency.

Suggested change
for (int i = 0; i < kVectorSize; i++) regs[i] = new_regs[i];
}
template <typename LType, int SF_TILE_DIM_M, int SF_TILE_DIM_K>
for (int i = 0; i < kVectorSize; i++) regs[i] = new_regs[i];
}
template <typename LType, int SF_TILE_DIM_M, int SF_TILE_DIM_K>

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +1428 to +1429
const bool rowwise_swizzle = all_has_data || all_nvfp4;
const bool columnwise_swizzle = all_has_columnwise_data && !all_nvfp4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading variable names in unswizzle function

rowwise_swizzle and columnwise_swizzle are declared inside multi_tensor_unswizzle_scaling_factors but refer to unswizzle operations, not swizzle. This can confuse future readers about the data-flow direction. Consider renaming to rowwise_unswizzle / columnwise_unswizzle to match the function's purpose.

Suggested change
const bool rowwise_swizzle = all_has_data || all_nvfp4;
const bool columnwise_swizzle = all_has_columnwise_data && !all_nvfp4;
const bool rowwise_unswizzle = all_has_data || all_nvfp4;
const bool columnwise_unswizzle = all_has_columnwise_data && !all_nvfp4;

@vthumbe1503 vthumbe1503 added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants